Introduction

The goal of this project is to know whether a savings customer will take a credit or not. We have different sources of data, including savings account transactions, ZIP code, ATM geographical and transactional information and open data regarding crime and sociodemographic areas in Mexico.

In this document we present the analysis that led to the computation of four models. There are around 12 million savings customers and 800 thousand credit AND savings customers in Banco Azteca (BAZ), from which we have a sample of 1 million people for savings. The analysis in this document is based on the information of this sample and the whole population of credit customers.

Variable Analysis

Transactions and Amount

abonos abonos_monto retiros retiros_monto num_meses tiempo_meses freq
0% 0 0.000 0 -39804335.20 1 1 0.0312500
5% 1 1.000 0 -135549.10 1 6 0.0526316
10% 1 50.000 0 -80000.00 1 8 0.0714286
15% 1 150.000 1 -55267.97 2 11 0.0967742
20% 1 554.000 1 -40600.00 2 13 0.1250000
25% 2 1212.975 2 -30724.68 3 14 0.1428571
30% 2 2075.000 2 -23850.00 3 16 0.1666667
35% 3 3200.000 3 -18616.00 4 18 0.2000000
40% 3 4636.000 4 -14750.00 4 19 0.2222222
45% 4 6200.000 5 -11500.00 5 21 0.2500000
50% 5 8350.000 7 -9006.00 5 23 0.2812500
55% 6 10800.000 8 -6940.00 6 25 0.3125000
60% 8 14000.000 10 -5170.00 7 27 0.3333333
65% 9 17850.000 12 -3897.79 8 29 0.3600000
70% 11 22917.640 15 -2650.00 9 30 0.3750000
75% 14 29950.000 19 -1650.00 10 31 0.3846154
80% 18 39550.000 25 -900.00 10 32 0.4210526
85% 24 54000.000 32 -300.00 11 32 0.5000000
90% 33 78500.000 45 0.00 12 32 0.5882353
95% 53 132810.063 72 0.00 12 32 0.7500000
100% 60199 42131739.960 22009 0.00 12 32 1.0000000

Gender

Tenure

Salary based on transactions

Number of months / Total months: Value between 0 and 1. If it’s 1 it means that the customer made an activity in all of the months that are available in the data; if it’s 0 it means that no activity took place. The value of 0 is not possible in this database because these are customers with at least one transaction.

credit_or_savings electronic_banking proportion
credit 0 0.9509844
credit 1 0.0490156
savings 0 0.9553800
savings 1 0.0446190
credit_or_savings active_electronic_banking proportion
credit 0 0.9686378
credit 1 0.0313622
savings 0 0.9772680
savings 1 0.0227310

Months with more activity

Histogram of date of first usage of credit

It can be seen that the number of people taking credits has been decreasing.

Geography

We have information about the customers’ ZIP code. This information could be used, with public available information from sources like INEGI, to know the socioeconomic level of each savings customer.

Available sources:

AGEB stands for Área GeoEstadística Básica (Basic Geostatistical Area), and a locality is a general term used by CONAPO to define several AGEBs.

This document uses information from the socioeconomic regions defined by INEGI.

ZIP code geographical information is available. According to the official postal code webpage, there are 32,448 different ZIP codes in Mexico, from which around 25,000 are available as shape files. The official ZIP code shapefiles are available in the open data government webpage, but not all them are available yet, the mexican postal service is still working in finding the delimiters of each code. Other resources are available, for example, a non-official collection of shapefiles of neighborhoods and ZIP codes. In addition, Google’s API for geocoding is a useful tool which is used as a last resort to find information about some ZIP codes.

Even with all this available information, there’s still a problem, which is that there are a bunch of ZIP codes which aren’t officially assigned to any human settlement but that are being used by people due to tradition or misinformation. So, geographic information may not be available for all customers, but it will be for most of them.

Problem:

The polygons defining the ZIP codes aren’t equivalent to the polygons defining the AGEBs, so a mapping between them is needed to be able to use the public available information. Perhaps the simplest solution is to find the centroid of each ZIP code and AGEB, and then just map a given ZIP code to the closest AGEB centroid.

AGEB classification:

We have a classification for each AGEB that pretends to show the differences among AGEBs based on indicators related with housing, education, health and employment, built from the last population census. Each AGEB can be classified in 7 strata such that stratum 7 contains AGEBs with the most favorable average conditions, and in stratum 1 are the AGEBs with the least favorable average conditions.

In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.

Map of Mexico City with centroids of each polygon:

Now, same map for Guadalajara, Jalisco:

And finally, for Monterrey, Nuevo León:

ZIP code information with their centroids can be seen in the next map of Mexico City:

ZIP code information with their centroids can be seen in the next map of Guadalajara. Some of the centroids may not match perfectly the polygon plotted because the database considers a the ZIP code and the identifier as a different group.

ZIP code information with their centroids can be seen in the next map of Monterrey:

Finally, plotting the centroids of AGEBs and ZIP codes in Mexico City altogether we get:

Guadalajara:

Monterrey:

So, for each available ZIP code, the closest AGEB centroid is found and a mapping is made to assign an AGEB to each ZIP code, such that we get a table in the following format:

ZIP ZIP long ZIP lat Nearest AGEB AGEB long AGEB lat Distance in Km Classification
56364 -98.93143 19.44496 1.503100e+12 -98.93469 19.44372 0.3680725 3
56367 -98.95076 19.44106 1.503100e+12 -98.94869 19.43894 0.3201608 4
56365 -98.94247 19.43852 1.503100e+12 -98.94134 19.43799 0.1325068 4
96340 -94.60759 18.00084 3.004801e+12 -94.60721 18.00117 0.0547658 6
42850 -99.33818 19.92243 1.306300e+12 -99.33511 19.91824 0.5655460 6
57850 -98.97560 19.38088 1.505800e+12 -98.97690 19.38002 0.1661747 6
97300 -89.70512 21.01598 3.110000e+12 -89.74094 21.02427 3.8302693 2
61531 -100.37365 19.42391 1.611200e+12 -100.37496 19.42216 0.2384809 4
41706 -98.41225 16.69447 1.204600e+12 -98.40838 16.69271 0.4568835 4
53750 -99.24115 19.45593 1.505700e+12 -99.24014 19.45617 0.1088229 6

In the following graph, a histogram is plotted showing the distribution of the distance between the centroid of the ZIP code and the centroid of the AGEB. The red lines represent quantiles 0.5, 0.75, 0.9 and 0.95. As can be seen, most of the mass is concentrated in distances shorter than 10 Km. This may seem like little, but in the case of a city, the landscape can change dramatically in 10 Km.

In the following graph, the distance histogram is plotted once more, but with with a different graph depending on whether the ZIP code is in a rural, urban, semiurban or unknown type of area. In the urban and semiurban areas, more than 95% of ZIP codes are within a 2.5 Km distance from the closest centroid. The rural areas are the ones that have a shorter tail, which seems reasonable because rural areas are usually larger and AGEB information is scarse in these areas.

The following graph shows the distribution of the distance of the 4 main states in Mexico.

The next graph combines the data of the last two graphs: it shows the distance distribution depending on whether the area is rural, urban, semiurban or unknown and on whether the ZIP code is in any of the 4 biggest states in Mexico. Once more, in the urban and semiurban areas the distance is smaller than in rural areas.

This approach may fail in the rural areas and also, as can be noted, ZIP code polygons are generally bigger in area than AGEBs, so the heterogeneity of each ZIP code is being ignored.

Customer analysis

First, let’s see what’s the distribution of the classification of AGEBs in the country. Remember that 7 is that the AGEB is “good” in average and that 1 is that it’s “bad”.

And now, the mapping of the ZIP codes:

The distribution changed considerably. As we can see in the following graph, originally the AGEBs were urban (U) and rural (R), but the mapping consists of only urban ZIP codes; so this may be a reason of why the distribution changed so much.

And now let’s analyze the sample with 1 million savings customers and circa 800 thousand credit customers.

Out of the 1859441, we have the mapping ZIP code for 1590674 of them, which are distributed the following way:

And now, conditioning on whether it’s a credit or savings customer:

Crime Rate

Using information about crime reports we create four indexes that together give us a picture of the crime in the region. The indexes that we produce are:

Models

To make the models, some variables were computed based on the transactions people have ¿¿¿¿in their savings account??????. These variables aim to reflect some kind of economical stability in their accounts, and computed making the assumption that the behavior prior to taking a credit is different than it is the rest of the time. To capture this idea in the variables, all of them were computed at different times prior to the date in which a credit was taken; for the customers that don’t have any credit, the last transaction date was used.

Also, the geographic variables were included. So, the variables computed were:

  • Number of deposits 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Number of withdrawals 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Number of overall transactions 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Sum of all deposits 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Sum of all withdrawals 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Maximum deposit 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Minimum deposit 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Maximum withdrawal 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Minimum withdrawal 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Median of deposits 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Median of withdrawals 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Number of days in before the maximum amount was deposited filtering by 1, 3 and 6 months before the credit was taken (or last available transaction)
  • Number of days in before minimum amount was deposited filtering by 1, 3 and 6 months before the credit was taken (or last available transaction)
  • Ratio of the maximum withdrawal and the median of withdrawals 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Ratio of the maximum deposit and the median of deposits 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Ratio of the maximum withdrawal and the overall median 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Ratio of the maximum deposit and the overall median 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Ratio of median of deposits and median of withdrawals 1, 3, 6 and months before the credit was taken (or last available transaction)
  • Ratio of the median of deposits of the last 3 months and the median of the deposits of the last 6 months
  • Ratio of the median of withdrawals of the last 3 months and the median of the withdrawals of the last 6 months
  • Ratio of the overall median of the last 3 months and the overall median of the last 6 months
  • Crime index (overall crime, non-violent crime, violent crime, kidnapping)
  • Classification of the AGEB linked to the ZIP code of the person

After this, a random forest was trained using all of these variables and the importance of each variable was computed, and the results were the following:

Variable Importance
Number of transactions in the last 30 days 16.020907
Number of days in before minimum amount was deposited filtering by one month 12.153303
Number of transactions in the last 360 days 11.933149
Number of days in before maximum amount was deposited filtering by one month 8.857325
Number of days in before maximum amount was deposited filtering by one year 8.472480
Median of overall transactions in the last year 7.976154
Ratio of the median of deposits and median of withdrawals for the last 3 months 6.314436
Ratio of the maximum deposit and the median of deposits for the las year 6.278312
Number of days in before minimum amount was deposited filtering by one year 6.185391
Number of withdrawals in the last year 5.785143
Median of deposits fot the last year 5.757832
Number of transactions in the last 3 months 5.621182
Number of the deposits in the last year 5.423741
Maximum deposit in the last year 5.349114
Minimum deposit in the last six months 5.292315
Ratio of the median of the deposits and median of overall transactions 5.260750
Sum of the deposits in one year 5.253443
Ratio of maximum withdrawal and median of overall withdrawals 5.092131
Sum of withdrawals in one year 5.082976
Crime index of the area where the customer lives 5.059758

The following plots show the densities of each of these variables conditioned by the response variable (1: credit, 0: savings). The vertical lines are percentiles 50, 75, 90 and 95.

The models were trained with 74,384 people, from which 18,746 took credits in the years 2014 and 2015; the rest only have savings accounts. The models were: Logistic regression, random forest, Gradient Boosting and Support Vector Machines. The following tables show the results of each model:

Random Forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16641   343
##          1    24  5307
##                                           
##                Accuracy : 0.9836          
##                  95% CI : (0.9818, 0.9852)
##     No Information Rate : 0.7468          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9557          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9986          
##             Specificity : 0.9393          
##          Pos Pred Value : 0.9798          
##          Neg Pred Value : 0.9955          
##              Prevalence : 0.7468          
##          Detection Rate : 0.7457          
##    Detection Prevalence : 0.7611          
##       Balanced Accuracy : 0.9689          
##                                           
##        'Positive' Class : 0               
## 

Gradient Boosting

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16567  1119
##          1    98  4531
##                                           
##                Accuracy : 0.9455          
##                  95% CI : (0.9424, 0.9484)
##     No Information Rate : 0.7468          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8466          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9941          
##             Specificity : 0.8019          
##          Pos Pred Value : 0.9367          
##          Neg Pred Value : 0.9788          
##              Prevalence : 0.7468          
##          Detection Rate : 0.7424          
##    Detection Prevalence : 0.7926          
##       Balanced Accuracy : 0.8980          
##                                           
##        'Positive' Class : 0               
## 

Logistic Regression

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16464  3589
##          1   201  2061
##                                           
##                Accuracy : 0.8302          
##                  95% CI : (0.8252, 0.8351)
##     No Information Rate : 0.7468          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4399          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9879          
##             Specificity : 0.3648          
##          Pos Pred Value : 0.8210          
##          Neg Pred Value : 0.9111          
##              Prevalence : 0.7468          
##          Detection Rate : 0.7378          
##    Detection Prevalence : 0.8986          
##       Balanced Accuracy : 0.6764          
##                                           
##        'Positive' Class : 0               
## 

SVM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16407  1396
##          1   258  4254
##                                           
##                Accuracy : 0.9259          
##                  95% CI : (0.9224, 0.9293)
##     No Information Rate : 0.7468          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.79            
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9845          
##             Specificity : 0.7529          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.9428          
##              Prevalence : 0.7468          
##          Detection Rate : 0.7352          
##    Detection Prevalence : 0.7978          
##       Balanced Accuracy : 0.8687          
##                                           
##        'Positive' Class : 0               
## 

As we can see the four models have a good performance. The random forest has the best results from all of them with 95% accuracy in the test sample; 94.5% of accuracy in predicting that a customer have just savings account and a 98.8% accuracy on predicting that a custemer has a credit. This results might sound promising but we must stay calm and keep repeating the test over the four models to ensure that a direct campaign could have the expected results.